43 research outputs found

    Solving k-center Clustering (with Outliers) in MapReduce and Streaming, almost as Accurately as Sequentially.

    Get PDF
    Center-based clustering is a fundamental primitive for data analysis and becomes very challenging for large datasets. In this paper, we focus on the popular k-center variant which, given a set S of points from some metric space and a parameter k0, the algorithms yield solutions whose approximation ratios are a mere additive term \u3f5 away from those achievable by the best known polynomial-time sequential algorithms, a result that substantially improves upon the state of the art. Our algorithms are rather simple and adapt to the intrinsic complexity of the dataset, captured by the doubling dimension D of the metric space. Specifically, our analysis shows that the algorithms become very space-efficient for the important case of small (constant) D. These theoretical results are complemented with a set of experiments on real-world and synthetic datasets of up to over a billion points, which show that our algorithms yield better quality solutions over the state of the art while featuring excellent scalability, and that they also lend themselves to sequential implementations much faster than existing ones

    Label-free detection of DNA single-base mismatches using a simple reflectance-based optical technique

    Get PDF
    Rapid and quantitative detection of the binding of nucleic acids to surface-immobilized probes remains a challenge in many biomedical applications. We investigated the hybridization of a set of fully complementary and defected 12-base long DNA oligomers by using the Reflective Phantom Interface (RPI), a recently developed multiplexed label-free detection technique. Based on the simple measurement of reflected light intensity, this technology enables to quantify the hybridization directly as it occurs on the surface with a sensitivity of 10 pg mm-2. We found a strong effect of single-base mismatches and of their location on hybridization kinetics and equilibrium binding. In line with previous studies, we found that DNA-DNA binding is weaker on a surface than in the bulk. Our data indicate that this effect is a consequence of weak nonspecific binding of the probes to the surface

    Fast and Scalable Mining of Time Series Motifs with Probabilistic Guarantees

    No full text
    Mining time series motifs is a fundamental, yet expensive task in exploratory data analytics. In this paper, we therefore propose a fast method to find the top-k motifs with probabilistic guarantees. Our probabilistic approach is based on Locality Sensitive Hashing and allows to prune most of the distance computations, leading to huge speedups. We improve on a straightforward application of LSH to time series data by developing a self-tuning algorithm that adapts to the data distribution. Furthermore, we include several optimizations to the algorithm, reducing redundant computations and leveraging the structure of time series data to speed up LSH computations. We prove the correctness of the algorithm and provide bounds to the cost of the basic operations it performs. An experimental evaluation shows that our algorithm is able to tackle time series of one billion points on a single CPU-based machine, performing orders of magnitude faster than the GPU-based state of the art

    What’s New in Temporal Databases?

    No full text
    Temporal databases has been an active research area since many decades, ranging from research work on query processing, most dominantly on selection and join queries, to new directions in models and semantics, such as for instance temporal probabilistic or streaming data. At the same time more database vendors have been integrating temporal features into their systems, most notably, the temporal features of the SQL standard. In this paper, we summarize the latest research developments as presented in 30 research papers over the last five years in the context of temporal relational databases. Additionally, we also describe the developments of industrial database systems and vendors

    A General Coreset-Based Approach to Diversity Maximization under Matroid Constraints

    No full text
    Diversity maximization is a fundamental problem in web search and data mining. For a given dataset S of n elements, the problem requires to determine a subset of S containing kg n "representatives"which maximize some diversity function expressed in terms of pairwise distances, where distance models dissimilarity. An important variant of the problem prescribes that the solution satisfy an additional orthogonal requirement, which can be specified as a matroid constraint (i.e., a feasible solution must be an independent set of size k of a given matroid). While unconstrained diversity maximization admits efficient coreset-based strategies for several diversity functions, known approaches dealing with the additional matroid constraint apply only to one diversity function (sum of distances), and are based on an expensive, inherently sequential, local search over the entire input dataset. We devise the first coreset-based algorithms for diversity maximization under matroid constraints for various diversity functions, together with efficient sequential, MapReduce, and Streaming implementations. Technically, our algorithms rely on the construction of a small coreset, that is, a subset of S containing a feasible solution which is no more than a factor 1-I away from the optimal solution for S. While our algorithms are fully general, for the partition and transversal matroids, if I is a constant in (0,1) and S has bounded doubling dimension, the coreset size is independent of n and it is small enough to afford the execution of a slow sequential algorithm to extract a final, accurate, solution in reasonable time. Extensive experiments show that our algorithms are accurate, fast, and scalable, and therefore they are capable of dealing with the large input instances typical of the big data scenario

    Distributed graph diameter approximation

    No full text
    We present an algorithm for approximating the diameter of massive weighted undirected graphs on distributed platforms supporting a MapReduce-like abstraction. In order to be efficient in terms of both time and space, our algorithm is based on a decomposition strategy which partitions the graph into disjoint clusters of bounded radius. Theoretically, our algorithm uses linear space and yields a polylogarithmic approximation guarantee; most importantly, for a large family of graphs, it features a round complexity asymptotically smaller than the one exhibited by a natural approximation algorithm based on the state-of-the-art 06-stepping SSSP algorithm, which is its only practical, linear-space competitor in the distributed setting. We complement our theoretical findings with a proof-of-concept experimental analysis on large benchmark graphs, which suggests that our algorithm may attain substantial improvements in terms of running time compared to the aforementioned competitor, while featuring, in practice, a similar approximation ratio

    Tools to generate and check consistency of model classes for Java PathFinder

    No full text
    Java PathFinder (JPF) is a model checker for Java applications. Like any other model checker, JPF has to combat the notorious state space explosion problem. Since JPF is a JVM, it can only model check Java bytecode and needs to handle native calls differently. JPF tackles the state space explosion problem and handles native calls by means of so-called model classes and native peers. In this paper we focus on model classes. For a class that either causes a state space explosion or that contains native calls, one can introduce a model class that either abstracts away particular details or implements the native call in Java. Rather than model checking the original class, JPF model checks the model class instead. Writing such model classes is time consuming and error prone. In this paper we propose two tools to assist with the development of model classes. The one tool generates a skeleton of a model class. The other tool checks whether a model class is consistent with the original class

    Efficient Computation of All-Window Length Correlations

    No full text
    The interactive exploration of time series is an important task in data analysis. In this paper, we concentrate on the investigation of linear correlations between time series. Since the correlation of time series might change over time, we consider the analysis of all possible subsequences of two time series. Such an approach allows identifying, at different levels of window length, periods over which two time series correlate and periods over which they do not correlate. We provide a solution to compute the correlations over all window lengths in O(n2) time, which is the size of the output and hence the best we can achieve. Furthermore, we propose a visualization of the result in the form of a heatmap, which provides a compact overview on the structure of the correlations amenable for a data analyst. An experimental evaluation shows that the tool is efficient to allow for interactive data exploration

    Industrial radioactive barite scale: suppression of radium uptake by introduction of competing ions

    No full text
    Incorporation of radioactive isotopes during the formation of barite mineral scale is a widespread phenomenon occurring within the oil, mining and process industries. In a series of experiments radioactive barite/celestite solid solutions (SSBarite-Celcstite) have been synthesized under controlled conditions by the counter diffusion of Ra-226, Ba2+, Sr24+ and SO42- ions through a porous medium (silica gel), to investigate inhibiting effects in Ra uptake associated with the introduction of a competing ion (Sr2+). From characterization studies, the particle size and the morphology of the crystals appear to be related to the initial [Sr]/[Ba] molar ratio of the starting solution. Typically, systems richer in Sr produce smaller sized crystals and clusters characterized by a lower degree of order. The activity introduced to the system is mainly incorporated in the crystals generated from the barite/celestite solid solution as suggested by the activity profiles of the hydrogel columns analysed by gamma-spectrometry. There is a relationship between the initial [Sr]/[Ba] molar ratio of the starting solution and the activity exhibited by the synthesized crystals. An effective inhibition of the Ra-226 uptake during formation of the crystals (SSBarite-Celestite) was obtained through the introduction of a competing ion (Sr2+): the higher the initial [Sr]/[Ba] molar ratio of the starting solution, the lower the intensity of the activity peak in the crystals. (C) 2003 Published by Elsevier Ltd
    corecore